A transparent artificial intelligence framework to assess lung disease in pulmonary hypertension

Recent studies have recognized the importance of characterizing the extent of lung disease in pulmonary hypertension patients by using Computed Tomography. The trustworthiness of an artificial intelligence system is linked with the depth of the evaluation in functional, operational, usability, safety and validation dimensions. The safety and validation of an artificial tool is linked to the uncertainty estimation of the model’s prediction. On the other hand, the functionality, operation and usability can be achieved by explainable deep learning approaches which can verify the learning patterns and use of the network from a generalized point of view. We developed an artificial intelligence framework to map the 3D anatomical models of patients with lung disease in pulmonary hypertension. To verify the trustworthiness of the framework we studied the uncertainty estimation of the network’s prediction, and we explained the learning patterns of the network. Therefore, a new generalized technique combining local explainable and interpretable dimensionality reduction approaches (PCA-GradCam, PCA-Shape) was developed. Our open-source software framework was evaluated in unbiased validation datasets achieving accurate, robust and generalized results.

approaches which can verify the learning patterns and use of the network from a generalized point of view. We developed an artificial intelligence framework to map the 3D anatomical models of patients with evidence of pulmonary hypertension. To this end, we studied a patched oriented multi-classification task of pulmonary hypertension by using different deep learning networks. To verify the trustworthiness of the framework we studied the uncertainty estimation of the network's prediction, and we explained the learning patterns of the network. Therefore a new generalized technique combines local explainable and interpretable dimensionality reduction approaches (PCA-GradCam, PCA-Shape) was developed. Our framework was evaluated in an unbiased validation scheme with 'seen' and 'unseen' datasets achieving accurate, robust and generalized results in the 'unseen' dataset (91.83 ± 3.48 % Jaccard score, 5.96 ± 3.17 mm Hamming distances, 80.21 ± 7.83 % MCC, 93.69 ± 3.90 accuracy, and 0.855 ± 0.40 Root Mean Square Error).

Overview
In this document we extend the study of A trustworthy artificial intelligence framework to assess lung disease in pulmonary hypertension with some extra information related to methodology implementation, and result (Tables and  Figures). The structure we follow is; Section 2 Methodology development, and Section 3 Results.

Methodology development
We developed a patch-oriented AI framework to map the 3D anatomical models of patients with lung disease in pulmonary hypertension. The framework has four steps: the extraction of the 3D patches, the classification of each patch, the volume rebuild of the 3D anatomical model, and the analysis and evaluation of the pulmonary hypertension profile of the lungs (S S1 manuscript).

Patch sizes parameters initialization
For the extraction of different patch sizes we used a variation of width, height and depth sizes (64x64x3, 32x32x3, 16x16x8, 8x8x3). To verify that each patch will include sufficient amount of lung information, we utilised a ratio about the lung segmented area which at least needed to be included in each patch. This ratio was at 80 %. We used a sliding windows to extract the patches of each patient without any overlapping of the windows.

Multi-classifiers
During the training we did not use any cross-validation protocol as the cohorts size was very large (over 5000 patches per class). Therefore we utilized a validation split protocol of 70 % training and 30 % validating samples to avoid any overfitting or biased effect of the training process.

Datasets and cohorts
The training validation and testing datasets (internal and unseen cohorts) can be provided after submission request to Senior author or correspond author.

Explanation of network's learning patterns of the generalized technique
Two experts AS and KD were visually inspected the results of Figure 6a,6c and Figure 7. In Figure 7 in each class, was PCA-Shape and PCA-GradCam images for each PCA component (first, second etc.). The PCA-Shape was the PCA of the input patches of each class region. In most of the cases in the PCA-Shape images the blue colour was correspond to lungs and the green to other 'no interest' regions (other organs, bones etc.). For the PCA-GradCam images the colour scale was from blue to red colour (red high focused pixels blue no focused). The correlation coefficient of the two images gives positive (blue colour) and negative correlation pixels (red colour) between the PCA-GradCam and PCA-Shape images. The negative correlation pixels meant that in these pixels the intensity level of the two images was inverse. For example, if the PCA-GradCam pixels had red colour then the PCA-Shape pixels had blue colour. In the positive correlation the two images had the same intensity orientation (both red scale). If the pixels correlation was 1 or -1 meant that the images had the same corresponding level of intensity either same orientation or inverse (the high or low intensity). If the correlation was smaller than 1 or -1 meant that the intensity correlation was weak, thus the two images had different level of pixels' intensity. When we computed the negative ratio and positive ratio, we computed the average number of positive and negative pixels' correlation in each case. If the one ratio was higher than the other meant that the network in that case focused on the inverse (negative) or on the same (positive) orientation as in the PCA-Shape image. As higher as the differences were between the positive and negative ratios, the stronger the positive or negative focusing of the network was. Regarding the correct or wrong learning pattern, the experts' needed to check the PCA-Shape images and observed in which area the model needed to focus. They needed to determine if it was the blue or green colour areas of the PCA-Shape image. After, we evaluated the results of the PCA-GradCam images and based on the above terminology we evaluated the positive and negative ratio. Finally, we did the same methodology to all the images of Fig 7. and we determined the correct or wrong learning patterns of the network. c.
The patch sizes from a-c are 64x64x3, 32x32x3, and 16x16x3 respectively. The plots verify that the networks during training avoid any possible overfitting. This figures summarize the results of DenseNet-121 network. The validation signal is the orange lines and the training signal with the blue lines.
3 Results Figure S1 shows the training and validation results of different metrics and patch sizes (64,32,16 a-c). These plots verify that the networks did not overfitted as the validation (orange lines) and training (blue lines) signals are similar. This figures summarizes the results of DenseNet-121 network. Tables S1, S2 S3, and S4 summarize the results of all the networks (Vgg-16, ResNet-50,DenseNet-121 and DenRes-131) in the validating and testing 'seen' cohorts of the multi-classification task. We extracted the manuscript's Figure S1 based on the Tables S1, S2 S3, and S4 scores. The AUC-ROC curves, precision, recall, and f1-score metrics have been used to evaluate the generalization and accuracy of the networks' classification. Fig. S2 presents the AUC-ROC curves of different deep learning models (DenseNet-121 and DenRes-131) on the datasets.